POLI 572B

Michael Weaver

February 4, 2019

Least Squares: multivariate regression

Plan for Today

Mechanics of Least Squares (Part 2)

  1. Questions on PS 2/ Last Lecture
  2. Review key mathematical properties of least squares
  3. Interpreting bivariate regression
  4. Multivariate regression

Questions

Questions on Problem Set 2

Questions on Least Squares Lecture 1

Reviewing Least Squares

Key insights about least squares:

It generalizes the mean

  • Conditional expectation function
  • To get the mean, regress \(y\) on column of \(1\)s.
  • Least squares lines always passes through point of averages (mean of \(x\), mean of \(y\))

Key insights about least squares:

By design, it minimizes distance between predicted/fitted values \(\hat{y}\) and actual values of \(y\).

  • Distance is minimized when residuals (\(y - \hat{y}\)) are orthogonal/perpendicular to the columns in \(X\) (variables we include)
  • Least squares calculations minimize this distance by design. (hence the name)

Key facts about least squares:

The mathematical procedures we use in least squares ensure that:

\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

Why?

We choose \(\begin{pmatrix}a \\ b \end{pmatrix}\) such that \(e\) is orthogonal to \(\mathbf{X}\). One column of \(\mathbf{X}\) is all \(1\)s, to get the intercept (recall how we used vectors to get the mean). So \(e\) is orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\).

\[\mathbf{1}'e = 0\]

And if this is true, the \(\sum e_i = 0\) so \(\frac{1}{n}\sum e_i = 0\).

Key facts about least squares:

The mathematical procedures we use in least squares ensure that:

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares. We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal. \(X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0\); \(\overline{e}=0\) from above; so \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0\).

This also means that residuals \(e\) are always perfectly uncorrelated (in the mathematical sense) with all the columns in our matrix \(\mathbf{X}\): all the variables we include in the regression model.

The Math of Least Squares

In matrix notation:

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} a \\ b \end{pmatrix}\]

\[(X_{p \times n}'X_{n \times p})^{-1}X_{p \times n}'Y_{n \times 1} = \beta_{p \times 1}\]

The Math of Least Squares

Bivariate Regression slope and intercept:

slope:

\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]

intercept:

\[a = \overline{y} - \overline{x}\cdot b\]

The Math of Least Squares

Fitted/Predicted Values

\[\widehat{Y_i} = a + b \cdot x_i\]

\[\widehat{Y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]

\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} = \widehat{Y} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix}\]

The Math of Least Squares

Two “assumptions”

In order for calculations of least squares to be possible, two things must be true:

  1. \(Var(x) \neq 0\)
  • Recall \(b = \frac{Cov(x,y)}{Var(x)}\), so undefined if \(Var(x) = 0\)
  1. \(n >= p\), more observations than parameters in \(\beta\).

Interpreting Bivariate Least Squares

Doing Least Squares in R

Let’s look at some data:

fheight sheight
65.04851 59.77827
63.25094 63.21404
64.95532 63.34242
65.75250 62.79238
61.13723 64.28113
63.02254 64.24221

Doing Least Squares in R

Doing Least Squares in R

#"X" matrix
X = cbind(1, father.son$fheight)

head(X)
##      [,1]     [,2]
## [1,]    1 65.04851
## [2,]    1 63.25094
## [3,]    1 64.95532
## [4,]    1 65.75250
## [5,]    1 61.13723
## [6,]    1 63.02254
#Y matrix 
Y = father.son$sheight

head(Y)
## [1] 59.77827 63.21404 63.34242 62.79238 64.28113 64.24221

Doing Least Squares in R

beta = solve(t(X) %*% X) %*% t(X) %*% Y

What should the dimension of beta be?

Doing Least Squares in R

beta = solve(t(X) %*% X) %*% t(X) %*% Y

beta
##           [,1]
## [1,] 33.886604
## [2,]  0.514093

Doing Least Squares in R

Typically we use lm function. It assumes an intercept by default

#lm(y ~ x1 + x2 + ... + xn, data = data.frame.with.our.variables)
ls = lm(sheight ~ fheight, data = father.son)

Doing Least Squares in R

Did we get the same results?

summary(ls)$coefficients
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
## fheight      0.514093 0.02704874 19.00618 1.121268e-69
beta
##           [,1]
## [1,] 33.886604
## [2,]  0.514093

Doing Least Squares in R

Do residuals have the expected properties?

y_hat = beta[1,] + beta[2,] * father.son$fheight

#Get residuals
e = father.son$sheight - y_hat

#Are residuals mean 0?
mean(e)
## [1] -5.945586e-12
#Are residuals orthogonal with X?
t(e) %*% X
##               [,1]          [,2]
## [1,] -6.409351e-09 -4.322255e-07
cov(e, X)
##      [,1]        [,2]
## [1,]    0 1.49169e-12

Doing Least Squares in R

Do residuals have the expected properties?

#Do the residuals have 0 covariance with X?
cov(e, X)
##      [,1]        [,2]
## [1,]    0 1.49169e-12

Doing Least Squares in R

Do residuals have the expected properties?

Doing Least Squares in R

We observe what we expect mathematically:

  • least squares chooses \(a\) and \(b\) so that \(\hat{Y}\) is closest to \(Y\) and residuals \(e\) are orthogonal to \(X\).
  • Residuals are mean 0 (or off due to computer rounding error), have 0 covariance with \(X\).

Doing Least Squares in R

Let’s consider this data:

Doing Least Squares in R

Let’s use least squares

#"X" matrix
X = cbind(1, x1)

#Y matrix 
Y = y1

#Get Beta
beta = solve(t(X) %*% X) %*% t(X) %*% Y

Doing Least Squares in R

Let’s use least squares

Doing Least Squares in R

Do residuals have the expected properties?

#Get y hat
y_hat = X %*% beta
#Get residuals
e = Y - y_hat
#Are residuals mean 0?
mean(e)
## [1] -2.29981e-17
#Is covariance of residuals and X == 0?
cov(e, X)
##                  x1
## [1,] 0 2.197894e-15

Doing Least Squares in R

Do residuals have the expected properties?

Doing Least Squares

KEY TAKEAWAY:

Even though residuals are mechanically uncorrelated with \(X\)

  • this only means uncorrelated in the mathematical sense (Pearson correlation).
  • residuals and \(X\) may still be associated in a non-linear way.

Interpreting Least Squares

Before we go further, we need to practice interpretation

Interpreting Least Squares

Interpreting Least Squares

lm(sheight ~ fheight, data = father.son) %>% summary %>% .$coefficients
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
## fheight      0.514093 0.02704874 19.00618 1.121268e-69

Is the intercept substantively meaningful?

How do you interpret the slope?

Interpreting Least Squares

Let’s center father’s height on its mean…

father.son$fheight_centered = father.son$fheight - mean(father.son$fheight)
lm(sheight ~ fheight_centered, data = father.son) %>% summary %>% .$coefficients
##                   Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      68.684070 0.07421078 925.52689 0.000000e+00
## fheight_centered  0.514093 0.02704874  19.00618 1.121268e-69

Is the intercept substantively meaningful?

How do you interpret the slope?

Interpreting Least Squares

Sometimes people talk about \(R^2\)

  • This supposed to capture “goodness-of-fit”/“explained variance”

  • But this is a concept that is not about causality, just about how well variance in \(Y\) can be “predicted” by \(X\).

Interpreting Least Squares

“Explained Variance”

total sum of squares: variance of the outcome. (\(\frac{1}{n}\sum (Y_i - \bar{Y})^2\))

residual sum of squares: variance of residuals (\(\frac{1}{n}\sum (Y_i - \hat{Y_i})^2\))

model sum of squares: variance of model estimates (\(\frac{1}{n}\sum (\hat{Y_i} - \bar{Y})^2\))

R-squared: fraction of variance in \(Y\) “explained” by the model. (\(R^2 = 1 - \frac{RSS}{TSS} = \frac{MSS}{TSS}\))

Interpreting Least Squares

“Explained Variance”

  • Summarizes the fit between the data and the regression equation (of the line)
  • High \(R^2\) indicates a good fit: the residuals are small relative to the standard deviation of \(Y\).
  • but the units are wrong for “typical residual distance” because it is squared and standardized between 0 and 1.
  • Not about causality. “Explained” is a misnomer.

Multivariate Least Squares

Multivariate Least Squares:

Previously we fit the mean of \(Y\) as a linear function of \(X\):

\[Y_i = a + b \cdot X_i\]

Now, we can imagine fitting mean of \(Y\) as a linear function of many variables:

\[Y_i = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

Multivariate Least Squares:

  • When we calculated the mean using matrix algebra, we projected the \(n\) dimensional vector \(Y\) onto a point on a one-dimensional line.
  • When we calculated the bivariate regression line, we projected the \(n\) dimensional vector \(Y\) onto a \(2\)-dimensional space (one for \(a\) and one for \(b\))
  • When we use multi-variate regression, we project the \(n\) dimensional vector \(Y\) onto a \(p\) dimensional space (one for each parameter/coefficient)

Multivariate Least Squares:

Thinking about “projecting onto \(p\) dimensions” can be helpful:

When we project into two dimensions, these dimensions are like the \(x\) and \(y\) axes on a graph: perpendicular to each other. This not an analogy, it is exactly what we do with multi-variate regression, because we are going to project \(Y\) onto the orthogonal components in \(X\).

Mathematical Requirements:

  1. Matrix \(X\) has “full rank”
  • This means that all of the columns of \(X\) are linearly independent.
    • cannot have two identical columns
    • cannot have set of columns that sum up to another column multiplied by a scalar
  • If \(X\) is not full rank, cannot be inverted, cannot do least squares.
    • but we’ll see more on the intuition for this later.
  1. \(n \geq p\): we need to have more data points than variables in our equation
  • no longer trivial with multiple regression

Multivariate Least Squares:

Examples: Linear Dependence

1 1 0 0
1 0 1 0
1 0 0 1

Multivariate Least Squares:

Examples: Linear Dependence

1 2 0 0
1 0 2 0
1 0 0 2

Multivariate Least Squares:

Examples: Linear Dependence

1 0.50 0.25 0.25
1 0.25 0.50 0.25
1 0.25 0.25 0.50

Multivariate Least Squares:

When we include more than one variable in the equation, we cannot calculate slopes using simple algebraic expressions like \(\frac{Cov(X,Y)}{Var(X)}\).

  • Must use matrix algebra (this is why I introduced it)

We calculate least squares using same matrix equation (\((X'X)^{-1}X'Y\)) as in bivariate regression, but what is the math doing in the multivariate case?

Multivariate Least Squares:

When fitting the equation:

\(Y = b_0 + b_1x + b_2z\)

  1. \(b_1 = \frac{Cov(x^*, Y)}{Var(x^*)}\)

Where \(x^* = x - \hat{x}\) from the regression: \(x = c_0 + c_1 z\).

  1. \(b_2 = \frac{Cov(z^*, Y)}{Var(z^*)}\)

Where \(z^* = z - \hat{z}\) from the regression: \(z = d_0 + d_1 x\)

Does anything look familiar here?

Multivariate Least Squares:

More generally:

\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:

\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{k-1} X_{(k-1)}\)

\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)

Multivariate Least Squares:

How do we make sense of \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?

  • It is the residual in same wasy as \(e\): \(X_k^*\) is orthogonal to all other variables in \(X_{j \neq k}\).
  • Like we said above, it is “perpendicular” to other variables, like axes on the graph.
  • It is mechanically uncorrelated (in the linear sense) with all other variables in the regression.

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)

  • The slope \(b_k\) is the change in \(Y\) with a one-unit change in the part of \(X_k\) that is uncorrelated with/orthogonal to the other variables in the regression.

  • Sometimes people say “the slope of \(X_k\) controlling for variables \(X_{j \neq k}\)”.
  • is it “holding other factors constant”/ceteris parabis? Not quite
  • better to think of it as “partialling out” the relationship with other variables in \(\mathbf{X}\).

Multivariate Least Squares:

There are additional implications of defining the slope \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\):

Now we can see why columns of \(X\) must be linearly independent:

  • e.g. if \(X_1\) were linearly dependent on \(X_2\) and \(X_3\), then \(X_2\) and \(X_3\) perfectly predict \(X_1\).
  • If \(X_1\) is perfectly predicted by \(X_2\) and \(X_3\), then the residuals \(X_1^*\) will all be \(0\).
  • If \(X_1^*\) are all \(0\)s, then \(Var(X_k^*) = 0\), and \(b_k\) is undefined.

Example

Let’s say we want to estimate the mean earnings for professionals as a function of hours worked, profession, gender, and age.

Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers. Here we look at UHRSWORK (the usual hours worked per week), SEX (an indicator for female), AGE (in years), LAW (an indicator for being a lawyer), and INCEARN (total annual income earned in dollars). We restrict our sample to full time workers (over 35 hours per week and 40 weeks per year)

Example

Let’s now estimate the relationship between hours worked and earnings, conditioning on age, profession, and gender:

ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + SEX, acs_data)

Example

summary(ls)
## 
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + SEX, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -282796  -89635  -29132   70136  772441 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    847.0     9332.0   0.091    0.928    
## UHRSWORK      1251.9      113.5  11.030   <2e-16 ***
## AGE           3473.9      124.1  27.991   <2e-16 ***
## LAW         -47539.3     2629.5 -18.079   <2e-16 ***
## SEX         -39363.2     2786.2 -14.128   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 126900 on 9995 degrees of freedom
## Multiple R-squared:  0.149,  Adjusted R-squared:  0.1487 
## F-statistic: 437.5 on 4 and 9995 DF,  p-value: < 2.2e-16

Example

Example

Example

Example

Multivariate Least Squares:

Key Takeaways:

  • Multi-variate least squares generalizes bi-variate model directly (in matrix form)
  • Mathematically, least squares returns slopes on relationship between a variable and the mean of \(Y\) residual on other variables in the equation.
  • Many people interpret this residualizing as “controlling” or “holding constant”, but it is not quite this simple
  • To what degree is “controlling” or “holding constant” other factors the same as balance in potential outcomes delivered by randomization?

Multivariate Least Squares:

Dummy Variables

What are they?

Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise

How do we make them?

  • Easy to do in R with as.factor()

lm(y ~ as.factor(category_var))

Dummy Variables

What do they do?

  • They estimate a new intercept (mean) for some group in the data. (Why is this the case?)

Dummy Variables | 2 categories

summary(lm(INCEARN ~ sex, acs_data))$coefficients
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 143878.05   2358.730 60.99810 0.000000e+00
## sexMale      58918.14   2873.297 20.50541 1.450084e-91

Dummy Variables | 2 categories

  • We can’t have all categories and the intercept.
  • Why?
  • For all observations: Female + Male = 1. We estimate intercept by including a column of 1s. These are linearly dependent, so we can’t include all of them.

Dummy Variables | 2 categories

We can include all categories and drop the intercept

summary(lm(INCEARN ~ -1 + sex, acs_data))$coefficients
##           Estimate Std. Error  t value Pr(>|t|)
## sexFemale 143878.0   2358.730  60.9981        0
## sexMale   202796.2   1640.801 123.5958        0

Dummy Variables | 2 categories

How does our interpretation change?

##           Estimate Std. Error  t value Pr(>|t|)
## sexFemale 143878.0   2358.730  60.9981        0
## sexMale   202796.2   1640.801 123.5958        0

Dummy Variables | 5 categories

We can consider a case where there are several categories:

Brexit Vote: What are regional differences in vote for Brexit?

Five regions: Southern England, Midlands, Northern England, Wales, and Scotland

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn +  Region5, brexit))$coefficients
##                             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)              44.61020359  7.7829764  5.7317665 2.048832e-08
## Pct_Trn                   0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5The Midlands       8.57606186  1.2532253  6.8431923 3.179826e-11
## Region5Northern England   6.35984691  1.3248565  4.8004045 2.293393e-06
## Region5Wales              2.30867033  2.0441434  1.1294072 2.594499e-01
## Region5Scotland         -11.60364022  1.8478179 -6.2796448 9.422928e-10

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients
##                             Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)              53.18626546  7.7855372   6.8314188 3.420699e-11
## Pct_Trn                   0.08933588  0.1027488   0.8694588 3.851539e-01
## Region5Northern England  -2.21621496  1.5559745  -1.4243260 1.551859e-01
## Region5Wales             -6.26739154  2.2030506  -2.8448696 4.687357e-03
## Region5Scotland         -20.17970209  2.0149071 -10.0152022 4.494699e-21
## Region5Southern England  -8.57606186  1.2532253  -6.8431923 3.179826e-11

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients
##                             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)              50.97005050  7.3635671  6.9219239 1.946734e-11
## Pct_Trn                   0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5Wales             -4.05117658  2.1752891 -1.8623624 6.333598e-02
## Region5Scotland         -17.96348713  1.9097366 -9.4062643 5.303722e-19
## Region5Southern England  -6.35984691  1.3248565 -4.8004045 2.293393e-06
## Region5The Midlands       2.21621496  1.5559745  1.4243260 1.551859e-01

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients
##                             Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)              46.91887392  7.6346802  6.1454930 2.043370e-09
## Pct_Trn                   0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5Scotland         -13.91231055  2.4939139 -5.5785047 4.660849e-08
## Region5Southern England  -2.30867033  2.0441434 -1.1294072 2.594499e-01
## Region5The Midlands       6.26739154  2.2030506  2.8448696 4.687357e-03
## Region5Northern England   4.05117658  2.1752891  1.8623624 6.333598e-02

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ Pct_Trn +  Region5, brexit))$coefficients
##                            Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)             33.00656337  7.2248567  4.5684731 6.681290e-06
## Pct_Trn                  0.08933588  0.1027488  0.8694588 3.851539e-01
## Region5Southern England 11.60364022  1.8478179  6.2796448 9.422928e-10
## Region5The Midlands     20.17970209  2.0149071 10.0152022 4.494699e-21
## Region5Northern England 17.96348713  1.9097366  9.4062643 5.303722e-19
## Region5Wales            13.91231055  2.4939139  5.5785047 4.660849e-08

Dummy Variables | 5 categories

How do you interpret this?

summary(lm(Pct_Lev ~ -1 + Pct_Trn +  Region5, brexit))$coefficients
##                            Estimate Std. Error   t value     Pr(>|t|)
## Pct_Trn                  0.08933588  0.1027488 0.8694588 3.851539e-01
## Region5Scotland         33.00656337  7.2248567 4.5684731 6.681290e-06
## Region5Southern England 44.61020359  7.7829764 5.7317665 2.048832e-08
## Region5The Midlands     53.18626546  7.7855372 6.8314188 3.420699e-11
## Region5Northern England 50.97005050  7.3635671 6.9219239 1.946734e-11
## Region5Wales            46.91887392  7.6346802 6.1454930 2.043370e-09

Dummy Variables | 5 categories

In R: you can shift the reference category using the following:

levels = c('Scotland', 'Southern England', 'The Midlands', 'Northern England', 'Wales')
brexit$Region5 = factor(brexit$Region5, levels = levels)

The first category is dropped (in this case, Scotland)